Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topological correctness #54

Merged
merged 20 commits into from
Oct 1, 2024
Merged

Topological correctness #54

merged 20 commits into from
Oct 1, 2024

Conversation

dkhofer
Copy link
Collaborator

@dkhofer dkhofer commented Sep 25, 2024

No description provided.

fixtures/aa.gfa Outdated
S 1 A SN:Z:123 SO:i:0 SR:i:0
S 2 A SN:Z:123 SO:i:0 SR:i:0
L 1 + 2 + *
P 124 1+,2+ 4M
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what the parser would do, but this should be 1M (1 match)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Shouldn't it be 0M for overlap of length 0 if it's representing the sequence AA?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be. I knew 4m was way too much :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -57,13 +64,16 @@ pub fn import_fasta(
.sequence(&sequence)
.save(conn)
};
let node = Node::create(conn, &seq.hash);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if i import the same file 2x, we'll have duplicate nodes + edges made?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It's not great, but I don't see it as blocking. I think properly detecting the duplicates in general would take some work, worth doing long term.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that much of a problem on the fasta import side? When would people import the same fasta file and expect them to collapse onto the same nodes and edges? I can understand on the vcf import you'd eventually want a heuristic that says something like "for a pair of edges (p, q), if there is already a pair of edges (a, b) with a shared node id (from the reference path) and identical coordinates on all side, then check to see if the new node has the same hash as the old node."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not an issue for fasta, but definitely VCF and GFA, since GFA doesn't do anything to canonically label segments and links

src/models/node.rs Outdated Show resolved Hide resolved
INSERT INTO sequence (hash, sequence_type, sequence, name, file_path, "length") values ("start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "OTHER", "start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "", "", 64), ("end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "OTHER", "end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "", "", 64);
INSERT INTO nodes (id, sequence_hash) values (1, "start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy");
INSERT INTO nodes (id, sequence_hash) values (2, "end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");
UPDATE SQLITE_SEQUENCE SET seq = 2 WHERE name = 'nodes';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, so that's nice. Removed

sequence_hash
);
let _ = conn.execute(&insert_statement, ());
conn.last_insert_rowid() as i32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a bad feeling about using this instead of returning id. It seems like we're opening ourselves up to race conditions around this connection session and insertion execution order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I let autopilot write a lot of this code and I zoned on this. Change to use returning

Copy link
Collaborator

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest concern is the last_id thing biting us in the ass if we add concurrency eventually. I don't feel good about getting the id by a is-probably-right mechanism vs. one we know to be.

Eventually, we'll need to see how to make node creation stateless wrt uniqueness. I think that using from_node_id + alt seq + to_node_id may accomplish that. Thought is that we should expect multiple vcfs to be loaded over time and we should expect overlap between each vcf file. We shouldn't be duplicating those entries every time. I think that would also dedupe fastas as well (hash on from node (source), seq, to node (sink) and name i think may get it done, maybe need collection too.).

@dkhofer
Copy link
Collaborator Author

dkhofer commented Oct 1, 2024

Definitely agree that we need the deduping. I'm thinking about it more and am open to your suggestion about from_node_id + alt seq + to_node_id, but I need to convince myself that that's all that's needed. I don't have a specific example in mind but I feel like duplicates could pop up more generally, and maybe we need something stronger to handle more general cases.

@dkhofer dkhofer merged commit 85f0940 into main Oct 1, 2024
1 check passed
@dkhofer dkhofer deleted the topological-correctness branch October 1, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants